Lab 08 - RcppArmadillo + OpenMP + Slurm

Learning goals

  • How we can use OpenMP to speed up loops.
  • Submit a job (array) to Slurm.

Lab task

We have the following C++ function that needs to be sped up:

#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;

//' Fourth order biweight kernel
// [[Rcpp::export]]
arma::mat K4B_v1(
    arma::mat X,
    arma::mat Y,
    arma::vec h
) {
  
  arma::uword n_n = X.n_rows;
  arma::uword n_m = Y.n_cols;
  arma::mat Nhat(n_n, n_m);
  arma::vec Dhat(n_n);
  arma::mat Yhat(n_n, n_m);
  
  for (arma::uword i = 0; i < n_n; ++i)
  {
    
    const auto xrow_i = X.row(i);
    for (arma::uword j = 0; j < i; ++j)
    {
      
      arma::vec Dji_h = (X.row(j) - xrow_i) / h;
      auto Dji_h2 = arma::pow(Dji_h, 2.0);
      
      double Kji_h = prod(
        (arma::abs(Dji_h) < 1) %
          (1.0 - 3.0 * Dji_h2) %
          arma::pow(1.0 - Dji_h2, 2.0) * 105.0 / 64.0
      );
      
      Dhat(i) += Kji_h;
      Dhat(j) += Kji_h;
      
      Nhat.row(i) += Y.row(j) * Kji_h;
      Nhat.row(j) += Y.row(i) * Kji_h;
      
    }
    
  }
  
  for (size_t i = 0u; i < n_n; ++i)
  {
    if (Dhat(i) != 0)
      Yhat.row(i) = Nhat.row(i)/Dhat(i);
  }
  
  return(Yhat);
  
}

We will use OpenMP to accelerate the function.

RcppArmadillo and OpenMP

RcppArmadillo + OpenMP
=

  • Friendlier than RcppParallel… at least for ‘I-use-Rcpp-but-don’t-actually-know-much-about-C++’ users (like myself!).

  • Must run only ‘Thread-safe’ calls, so calling R within parallel blocks can cause problems (almost all the time).

  • Use arma objects, e.g. arma::mat, arma::vec, etc. Or, if you are used to them std::vector objects as these are thread-safe.

  • Pseudo Random Number Generation is not very straightforward… But C++11 has a nice set of functions that can be used together with OpenMP

  • Need to think about how processors work, cache memory, etc. Otherwise, you could get into trouble… if your code is slower when run in parallel, then you probably are facing false sharing

  • If R crashes… try running R with a debugger (see Section 4.3 in Writing R extensions):

    ~$ R --debugger=valgrind

RcppArmadillo and OpenMP workflow

  1. Tell Rcpp that you need to include that in the compiler:

    #include <omp.h>
    // [[Rcpp::plugins(openmp)]]
  2. Within your function, set the number of cores, e.g

    // Setting the cores
    omp_set_num_threads(cores);

RcppArmadillo and OpenMP workflow

  1. Tell the compiler that you’ll be running a block in parallel with OpenMP

    #pragma omp [directives] [options]
    {
      ...your neat parallel code...
    }

    You’ll need to specify how OMP should handle the data:

    • shared: Default, all threads access the same copy.
    • private: Each thread has its own copy, uninitialized.
    • firstprivate Each thread has its own copy, initialized.
    • lastprivate Each thread has its own copy. The last value used is returned.

    Setting default(none) is a good practice.

  2. Compile!

Ex 5: RcppArmadillo + OpenMP

Our own version of the dist function… but in parallel!

#include <omp.h>
#include <RcppArmadillo.h>
using namespace Rcpp;

// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::plugins(openmp)]]

// [[Rcpp::export]]
arma::mat dist_par(const arma::mat & X, int cores = 1) {
  
  // Some constants
  int N = (int) X.n_rows;
  int K = (int) X.n_cols;
  
  // Output
  arma::mat D(N,N);
  D.zeros(); // Filling with zeros
  
  // Setting the cores
  omp_set_num_threads(cores);
  
#pragma omp parallel for shared(D, N, K, X) default(none)
  for (int i=0; i < N; ++i)
  {
    
    for (int j=0; j < i; ++j)
    {
      
      // Computing the distance
      D.at(i,j) = std::sqrt(
        arma::sum(arma::pow(X.row(i) - X.row(j), 2.0))
      );
      
      // Computing square root
      D.at(j,i) = D.at(i,j);
      
    }
    
  }
    
  // My nice distance matrix
  return D;
  
}

The key lines:

  • #include <omp.h> Includes the OpenMP library in our program.

  • // [[Rcpp::plugins(openmp)]] Notifies Rcpp::sourceCpp we are using OpenMP.

  • omp_set_num_threads(cores) Sets the number of cores.

  • #pragma omp parallel for... Specifies the instruction.

Let’s see how much faster our function is

# Sourcing
Rcpp::sourceCpp("dist_par.cpp")

# Simulating data
set.seed(1231)
K <- 50
n <- 10000
x <- matrix(rnorm(n*K), ncol=K)
# Are we getting the same?
hist(as.matrix(dist(x)) - dist_par(x, 4)) # Only zeros

# Benchmarking!
bm <- microbenchmark::microbenchmark(
  dist(x),                 # stats::dist
  dist_par(x, cores = 1),  # 1 core
  dist_par(x, cores = 4),  # 4 cores
  dist_par(x, cores = 8),  # 8 cores
  times = 1
)

print(bm, unit = "s")
Unit: seconds
                   expr      min       lq     mean   median       uq      max
                dist(x) 2.903584 2.903584 2.903584 2.903584 2.903584 2.903584
 dist_par(x, cores = 1) 2.349448 2.349448 2.349448 2.349448 2.349448 2.349448
 dist_par(x, cores = 4) 1.376363 1.376363 1.376363 1.376363 1.376363 1.376363
 dist_par(x, cores = 8) 1.041904 1.041904 1.041904 1.041904 1.041904 1.041904
 neval
     1
     1
     1
     1
print(bm, unit = "relative")
Unit: relative
                   expr      min       lq     mean   median       uq      max
                dist(x) 2.786805 2.786805 2.786805 2.786805 2.786805 2.786805
 dist_par(x, cores = 1) 2.254956 2.254956 2.254956 2.254956 2.254956 2.254956
 dist_par(x, cores = 4) 1.321007 1.321007 1.321007 1.321007 1.321007 1.321007
 dist_par(x, cores = 8) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
 neval
     1
     1
     1
     1

There’s more to OpenMP. I invite you to go over a set of experiments where I compare different strategies for implementing the same function here: https://github.com/UofUEpiBio/r-parallel-benchmark

Before continuing to the next section. Try to apply what we just learned to include OpenMP in the function K4B_v1. To do so, write a C++ function named K4B_v2 and try it out using the following code:

n <- 500
Y <- matrix(rnorm(n * n), ncol = n)
X <- cbind(rnorm(n))

# Running the benchmark
res <- microbenchmark::microbenchmark(
  K4B_v1(X, Y, 2),
  K4B_v2(X, Y, 2, 4),
  check = "equal",
  times = 10
)

print(res, unit = "relative")
print(res, unit = "ms")
# Unit: relative
#                expr     min       lq     mean   median       uq      max neval
#     K4B_v1(X, Y, 2) 3.32887 3.356954 2.268484 2.412626 1.781527 1.462261    10
#  K4B_v2(X, Y, 2, 4) 1.00000 1.000000 1.000000 1.000000 1.000000 1.000000    10
#  cld
#    b
#   a 
# Unit: milliseconds
#                expr       min        lq     mean    median       uq      max
#     K4B_v1(X, Y, 2) 227.04614 229.69930 231.3196 231.48199 233.8489 234.0478
#  K4B_v2(X, Y, 2, 4)  68.20517  68.42492 101.9710  95.94608 131.2631 160.0588
#  neval cld
#     10   b
#     10  a 

Submitting jobs to Slurm

We need to execute the following Rscript in CHPC1:

library(Rcpp)

# Compiling
Rcpp::sourceCpp("distance.cpp", verbose = TRUE)

id <- as.integer(Sys.getenv("SLURM_ARRAY_TASK_ID", unset = "0"))
message("We are working in array #", id)

is_array <- ifelse(
  Sys.getenv("SLURM_ARRAY_TASK_ID", unset = "") == "",
  FALSE,
  TRUE
)

# Generating seeds
set.seed(123)

# Need to make sure all arrays have different seeds
if (is_array) {
  seeds <- as.integer(Sys.getenv("SLURM_ARRAY_TASK_COUNT", 1))
  seeds <- sample.int(.Machine$integer.max, seeds)
  set.seed(seeds[id])
}

n <- 500
Y <- matrix(rnorm(n * n), ncol = n)
X <- cbind(rnorm(n))

# Running the benchmark
res <- microbenchmark::microbenchmark(
  K4B_v1(X, Y, 2),
  K4B_v2(X, Y, 2, 4),
  check = "equal", times = 10
)

print(res, unit = "relative")
print(res, unit = "ms")

png(
  ifelse(
    is_array,
    sprintf("openmp-example-%02i.png", id),
    "openmp-exampl.png"
    )
)
boxplot(res, unit = "ms")
dev.off()

Two ways of doing it: Single or multiple jobs (using job arrays.) To submit the job, I have created two bash (slurm) scripts we can submit with the sbatch function. Most of the time, we will be submitting single jobs (time-consuming operations, large memory, etc.) Job arrays are useful when the entire script can be parallelized. Submitting jobs using sbatch is straightforward:

sbatch [your-slurm-script]

Where your-slurm-script should be replaced with the name (and path, if needed) of the Slurm script. To check the status of your submitted jobs, you can use the squeue command, e.g.,

squeue -u[your user id]

Where your user id should be replaced with… your user ID. This is equivalent to typing the following command

squeue -u$(id -u)

Once your job is submitted, the R script should generate a boxplot like the following:

Single job

The distance-job.slurm contains the following lines:

#!/bin/sh
#SBATCH --job-name=rcpp-openmp-array
#SBATCH --account=vegayon-np
#SBATCH --partition=vegayon-shared-np
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --time=00:05:00
#SBATCH --output=slurmlog.log

# Loading R
module load R/4.2.2

#  Running the script
Rscript --vanilla distance.R

Job arrays

Job arrays are a way to submit the same job multiple times. The benefits of using job arrays are manyfold. One of the key advantages of this type of submission is that a single script can be executed across many nodes; making job arrays a powerful tool.

When job arrays start, Slurm generates a set of environment variables available to the user. One of those is the SLURM_ARRAY_TASK_ID environment variable. With the latter, we can identify which of the n tasks is the job running. In our R script, we do so using the Sys.getenv() function (to which we assign the object id):

# If not in a job array, the default value of id is "0"
id <- as.integer(Sys.getenv("SLURM_ARRAY_TASK_ID", unset = "0"))

That way, our R script can be instructed to behave differently depending on the array id. If your R script depends on generating random numbers (e.g., a simulation study,) it is paramount to properly deal with RNG seeds. A common way to do so is to set a seed for a baseline job, and then sample individual seeds for each one of the tasks. We do that as follows:

# Generating seeds
set.seed(123)

# Need to make sure all arrays have different seeds
if (is_array) {
  seeds <- as.integer(Sys.getenv("SLURM_ARRAY_TASK_COUNT", 1))
  seeds <- sample.int(.Machine$integer.max, seeds)
  set.seed(seeds[id])
}

The distance-job-array.slurm contains the following lines:

#!/bin/sh
#SBATCH --job-name=rcpp-openmp-array
#SBATCH --account=vegayon-np
#SBATCH --partition=vegayon-shared-np
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --time=00:05:00
#SBATCH --output=slurmlog%a.log
#SBATCH --array=1-2

# Loading R
module load R/4.2.2

#  Running the script
Rscript --vanilla distance.R

Footnotes

  1. The distance.R R script can be downloaded here.↩︎